In this document we learn how to create sophisticated plots with ggplot. Simply put, we are learning how to transform tidy data into visually clear graphs. In the overall context of the workflow, this falls into the category of transforming our data into data visualisations.
{{<expand "Note: LinkedIn Learning videos" "...">}} There are references to LinkedIn Learning videos. These are complementary but not really required as the notes below are meant to be self-contained. Some students and staff would have access for free. Do not purchase access unless you are sure you don’t have access through your organisation already. {{</expand>}}
library("tidyverse")
+) operator to create chartsThe ‘Grammar of Graphics’ is just a method of describing graphs we create. This is more relevant to ggplot than other packages we may use because of the unique syntax required by ggplot.
For every visualisation we wish to create, we must consider its properties.
In this example, both graphs display the same data, but bar graph is very different from a scatterplot. In the bar graph, the size of a bar represents the y-value of our data, whereas in the scatterplot this is represented by the height of a point. We need a language to describe what is going on here.
We describe components of a visualisation as such:
| Component of Visualisation | Definition | Examples |
|---|---|---|
| Data | The set of data we want to visualise | Height, weight, number of… |
| Geometries | Shapes we use to visualise data | Lines, bubbles, regions… |
| Aesthetics | Properties of geometries | Thickness, size, colour… |
| Scales | Mappings between geometries and aesthetics | Thickness of a line, size of a bubble, colour of a region… |
Now we have the language we need to fully describe visualisations. One uses a point-and-line geometry, with vertical and horizontal aesthetics reflecting x- and y-values, and one uses a bar geometry, describing y-values using the height aesthetic and x-values using the label aesthetic.
+) Operatorggplot()ggplot we begin with ggplot()ggplot()
ggplot() function has a data argumentWe can add data in two ways:
ggplot(data = sample_data)
sample_data %>%
ggplot()
We note that no data is showing up. This is because we haven’t specified any method of graphing using the “add” (+) operator.
+) Operatorggplot works by creating a plot with ggplot() and then ’add’ing to itggplot()ggplot()geom_point() function to our ggplot using +, we will get a plotsample_data %>%
ggplot() +
geom_point(aes(x = V1,
y = V2))
+) operator represents adding a geometrygeom_line() functionsample_data %>%
ggplot() +
geom_point(aes(x = V1,
y = V2)) +
geom_line(aes(x = V1,
y = V2))
We will be using the ACORN-SAT temperature data for our examples.
load("tidy_ACORN-SAT_data/station_data.rdata")
sample_temperature <- station_data %>%
filter(Station.name == "Sydney"
| Station.name == "Darwin"
| Station.name == "Adelaide"
| Station.name == "Alice Springs")
head(sample_temperature, 5)
## Number year average.temp Station.name Latitude Longitude Elevation Start
## 1 14015 1910 26.8 Darwin -12.42 130.89 30 1910
## 2 15590 1910 20.2 Alice Springs -23.80 133.89 546 1910
## 3 23090 1910 15.9 Adelaide -34.92 138.62 48 1910
## 4 66062 1910 18.0 Sydney -33.86 151.21 39 1910
## 5 14015 1911 26.5 Darwin -12.42 130.89 30 1910
We already know we choose data to plot using ggplot(). We know we can choose geometries using various other functions. Now we need to know how to specificy aesthetics.
mappingaes()aes() itself has arguments which control our aestheticsFor a scatterplot our geometry function is geom_point(), aes() has these arguments:
| Argument | Description |
|---|---|
| x | Data for the x-axis |
| y | Data for the y-axis |
| shape | Shape of points |
| color | Colour of points |
| size | Size of points |
There is also an argument alpha which controls the opacity of points, but this is an argument of geom_point(), not aes()
sample_temperature %>%
ggplot() +
geom_point(aes(x = year,
y = average.temp))
sample_temperature %>%
ggplot() +
geom_point(aes(x = year,
y = average.temp,
color = Station.name))
geom_line() is much like geom_point(), but it is used to create line graphssample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp))
Did the plot not work?
geom_line() with multiple categories, we must tell ggplot how to group them somehowgroup argument of aes()sample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name))
geom_smooth() is a new geometry functionsample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name))
se argument to falsesample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name),
se = FALSE)
So far we have seen the aes() function as an argument of ‘geom’ functions, and we will continue to see this throughout this material. It is worth noting, however, that the aes() function can be used as an argument of ggplot() instead of the geom functions. As an example consider this plot with multiple geometries:
sample_data %>%
ggplot() +
geom_point(aes(x = V1,
y = V2)) +
geom_line(aes(x = V1,
y = V2))
This can be rewritten using only one use of aes() if we use it as an argument of ggplot() instead of both geom_point() and geom_line():
sample_data %>%
ggplot(aes(x = V1,
y = V2)) +
geom_point() +
geom_line()
We get the same effect. Note that if we then used aes() within geom_point() (for example), the new aesthetics we supply would override the ggplot() aesthetics.
geom_bar()geom_col()geom_bar()geom_bar() function, the user specifies an x variable to graph, and the function takes the y variable to be a count of observations of the xhot_or_cold <- station_data %>%
filter(Station.name == "Sydney") %>%
mutate(warmth = ifelse(average.temp > 18, "hot", "cold"))
hot_or_cold %>%
ggplot() +
geom_bar(aes(x = warmth))
geom_col()geom_col() function is more versatile, allowing the user to specify both and x and y variablex is usually categorical, and the y quantativeWe introduce the Australian Environmental-Economic Accounts (2016) for our examples.
load("tidy_EnvAcc_data/consumption.rdata")
head(consumption, 12)
## # A tibble: 12 x 3
## State year water_consumption
## <chr> <chr> <dbl>
## 1 NSW 2008–09 4555
## 2 VIC 2008–09 2951
## 3 QLD 2008–09 3341
## 4 SA 2008–09 1179
## 5 WA 2008–09 1361
## 6 TAS 2008–09 466
## 7 NT 2008–09 160
## 8 ACT 2008–09 48
## 9 NSW 2009–10 4323
## 10 VIC 2009–10 2904
## 11 QLD 2009–10 3112
## 12 SA 2009–10 1110
consumption %>%
ggplot() +
geom_col(aes(x = year,
y = water_consumption))
color argument of aes() to colour by stateconsumption %>%
ggplot() +
geom_col(aes(x = year,
y = water_consumption,
color = State))
fill argument to create solid colourconsumption %>%
ggplot() +
geom_col(aes(x = year,
y = water_consumption,
fill = State))
ggplot are very flexiblegeom_histogram() function to specify a histogram geometrygeom_histogram(), and not aes())
x argument selects the variable to plotbins argument can choose the number of bins we use ORbinwidth argument can specify the width of the bins we place our observations inorigin argument specifies the number from which we begin setting bins and is only useful with the bins argumentWe again use annual temperatures in Sydney as an example.
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
bins = 10)
origin to ensure the first bin begins counting from 16.5 degreesstation_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
bins = 10,
origin = 16.5)
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5)
geom_boxplot() geometry functionx and a y variable
x must be a categorical variablex based on the distribution of our y dataWe can, as an example, examine how the average yearly temperatures of Sydney, Darwin, Adelaide and Alice Springs.
sample_temperature %>%
ggplot() +
geom_boxplot(aes(x = Station.name,
y = average.temp,
fill = Station.name))
As was stated before, ggplot is a package which allows for extremely detailed tweaking of graphs. This includes the ability to modify, create and delete:
We make such modifications using the theme() function. This function can take many different arguments:
plot.backgroundpanel.backgroundpanel.grid.majorpanel.grid.minorlegend.keyaxis.ticksaxis.titleEach relates to one element of a visualisation, and we consider each in turn.
For our tweaking examples, we’ll use our histogram:
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5)
We introduce the theme() function and discover we can modify several elements.
plot.background and set this to the element_rect() functioncolour argument which we can control, making the border a colour of our choicestation_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(plot.background = element_rect(colour = "red"))
fill argument, which fills the entire backgroundstation_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(plot.background = element_rect(fill = "red"))
panel.background and set this to the element_rect() functioncolour or fillstation_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(panel.background = element_rect(colour = "red"))
Or…
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(panel.background = element_rect(fill = "red"))
panel.grid.major and panel.grid.minorelement_line() functioncolour and sizeHere are some examples:
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(panel.grid.major = element_line(colour = "blue",
size = 5))
panel.grid.major.x and panel.grid.major.y respectivelystation_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(panel.grid.major.x = element_line(colour = "blue"),
panel.grid.major.y = element_line(colour = "green"))
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(panel.grid.minor = element_line(colour = "red"))
element_blank() functionstation_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
Combining all of these features can produce some cool looking graphs! If you have html colour codes, you can pick your own colours!
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme(plot.background = element_rect(colour = "#000000",
fill = "#A6FEFD"),
panel.background = element_rect(colour = "#000000",
fill = "#FFEEEE"),
panel.grid.major = element_line(size = 0.5,
colour = "#FF8383"),
panel.grid.minor = element_line(colour = "#B0FCFF"))
xlab() and ylab() can change the labels of axesxlim() and ylim() can change the range of what the axes displayIt’s simple enough to change axis names:
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
xlab("Average Yearly Temp. (°C)") +
ylab("Number of Observations (1910-2012")
As an example of axis range, observe contrast in the x-axis for the following:
sample_data %>%
ggplot() +
geom_point(aes(x = V1,
y = V2)) +
geom_line(aes(x = V1,
y = V2))
Versus…
sample_data %>%
ggplot() +
geom_point(aes(x = V1,
y = V2)) +
geom_line(aes(x = V1,
y = V2)) +
xlim(0,5) +
ylim(0,5)
We can also reduce the axis range, but this may cause a loss of data.
sample_data %>%
ggplot() +
geom_point(aes(x = V1,
y = V2)) +
geom_line(aes(x = V1,
y = V2)) +
xlim(0,3) +
ylim(0,3)
Finally, we can take care of this awfully scaled plot!
We introduce the concept of scale functions.
scale_x_discrete()scale_y_continuous()scale_size_continuous()scale_fill_manual()scale_colour_gradient()name |
Changes axis names (just like xlab()) |
limits |
Changes the axis range with precise control |
breaks |
Change the way numbers are displayed on our scale with precise control |
First of all, we rename axes with name and rescale our range. Note: limits and breaks take a vector of numbers as their value. We ususally use seq() to assign to breaks (specifying the start, end, and step-value) whilst we usually use a vector to assign to limits (specifying only a start and end value).
sample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
scale_x_discrete(name = "Year",
breaks = seq(1910, 2012, 10)) +
scale_y_continuous(name = "Yearly Temp. (°C)",
limits = c(15, 30))
Now this looks much nicer!
scale_fill_manual() or scale_colour_manual() functions to our plot to fine tune it
fill argument, use the first, and if we colour our data using the colour argument, use the secondvalues and guidevalues function takes a vector of colours and can be used to create custom colours in a chartsample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
scale_x_discrete(name = "Year",
breaks = seq(1910, 2012, 10)) +
scale_y_continuous(name = "Yearly Temp. (°C)",
limits = c(15, 30)) +
scale_colour_manual(values = c("#FF0000", "#0031FF", "#FF00EB", "#13BB00"))
The guide argument of the above two functions itself takes a function as its value. The function is called guide_legend().
guide_legend():title |
Changes legend title |
nrow |
Changes how many rows the legend uses to display data |
label.position |
Changes where the label title appears in the legend |
keywidth |
Numeric argument to modify the width of legend boxes |
legend.key |
Takes value of a vector of colours and specifies custom colours |
Let’s use all of the above on our chart.
sample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
scale_x_discrete(name = "Year",
breaks = seq(1910, 2012, 10)) +
scale_y_continuous(name = "Yearly Temp. (°C)",
limits = c(15, 30)) +
scale_colour_manual(values = c("#FF0000", "#0031FF", "#FF00EB", "#13BB00"),
guide = guide_legend(title = "Location",
nrow = 2,
label.position = "top",
keywidth = 2.5))
legend.position argument of the theme() function to change where our legend issample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
scale_x_discrete(name = "Year",
breaks = seq(1910, 2012, 10)) +
scale_y_continuous(name = "Yearly Temp. (°C)",
limits = c(15, 30)) +
scale_colour_manual(values = c("#FF0000", "#0031FF", "#FF00EB", "#13BB00"),
guide = guide_legend(title = "Location",
nrow = 1,
label.position = "top",
keywidth = 2.5)) +
theme(legend.position = "top")
Note: legend.position= "none" removes a legend entirely.
annotate() functionlabel argument, being the text of your annotationx and y arguments for position on a chartsample_temperature %>%
ggplot() +
geom_line(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
geom_smooth(aes(x = year,
y = average.temp,
color = Station.name,
group = Station.name)) +
scale_x_discrete(name = "Year",
breaks = seq(1910, 2012, 10)) +
scale_y_continuous(name = "Yearly Temp. (°C)",
limits = c(15, 30)) +
scale_colour_manual(values = c("#FF0000", "#0031FF", "#FF00EB", "#13BB00")) +
theme(legend.position = "none") +
annotate("text",
label = "Adelaide",
x = 90,
y = 16.5) +
annotate("text",
label = "Sydney",
x = 90,
y = 18.5) +
annotate("text",
label = "Alice Springs",
x = 90,
y = 22) +
annotate("text",
label = "Darwin",
x = 90,
y = 27.5)
geom_hline() or geom_vline() functionsyintercept or xintercept functions respectively and can be used to place lines at any point on a chartconsumption %>%
ggplot() +
geom_col(aes(x = year,
y = water_consumption,
fill = State)) +
ylim(0,22000) +
annotate("text",
label = "Peak",
x = 5,
y = 21000) +
geom_hline(yintercept = 19756)
ggtitle()subtitle argument of ggtitle()consumption %>%
ggplot() +
geom_col(aes(x = year,
y = water_consumption,
fill = State)) +
ggtitle("Water Consumption of Australia over Time",
subtitle = "Data provided by the Australian Environmental-Economic Accounts, 2016")
library("ggthemes")
station_data %>%
filter(Station.name == "Sydney") %>%
ggplot() +
geom_histogram(aes(x = average.temp),
binwidth = 0.5) +
theme_solarized()
ggplot:
theme_bw()theme_dark()theme_void()theme_minimal()ggthemes package (must be installed and library’d)
theme_solarized()theme_excel()theme_wsi()theme_economist()theme_fivethirtyeight()Here’s the ggplot Cheat Sheet published by R Studio!
ggplot and its sibling package ggmap are used to render maps and map-based charts in RUnfortunately, as of June 2018, Google updated its policy on API keys, significantly limiting the capabilities of ggmap for individuals who have not signed up to Google APIs with billing details. As a result the content of Section 4 is not considered in this course.
The LindedIn Learning tutorial provides a practice dataset, along with a challenge. Given the dataset, the challenge is:
Good luck!